**CUNEF**
Practice III - Retail score analysis
Machine Learning</strong>
Pablo Mazariegos Reviriego - pablo.mazariegos@cunef.edu
Mario Sabater Pascual - mario.sabater@cunef.edu

In this Machine Learning practice we will be working with the Yelp dataset. The whole practice will be composed by the following notebooks:
In this notebook we will perform an Exploratory Data Analysis on the restaurants dataset we exported in the Data selection & variables preparation notebook. We will look at the variables correlation, null, non and nan analysis of the variables and if needed we will drop those that will not work for our future models.
import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt
import plotly.express as px
import requests
import scipy.stats as ss
from sklearn.preprocessing import MinMaxScaler
import warnings
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', 500)
def get_postal_code(lat, lng):
# Replace YOUR_API_KEY with your actual API key
api_key = 'AIzaSyDg-H2ltcfpG5kJ8WLlzCTuNLNuoR4y8us'
endpoint = 'https://maps.googleapis.com/maps/api/geocode/json'
params = {
'latlng': f'{lat},{lng}',
'key': api_key,
}
response = requests.get(endpoint, params=params)
data = response.json()
if data['status'] == 'OK':
results = data['results']
if results:
address = results[0]['formatted_address']
for component in results[0]['address_components']:
if 'postal_code' in component['types']:
postal_code = component['long_name']
return postal_code
return None
As we have identified that there are some missing values in postal_code and not in latitude and longitude, we will use the geocode API from google maps to identify the postal code of such location.
def get_corr_matrix(dataset=None, metodo='spearman', size_figure=[10,8]):
# Para obtener la correlación de Spearman, sólo cambiar el metodo por 'spearman'
if dataset is None:
print(u'\nHace falta pasar argumentos a la función')
return 1
sns.set(style="white")
# Compute the correlation matrix
corr = dataset.corr(method=metodo)
# Set up the matplotlib figure
f, ax = plt.subplots(figsize=size_figure)
# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, #center=0,
square=True, linewidths=.5, cmap='coolwarm', vmin=-1, vmax=1) #cbar_kws={"shrink": .5}
plt.show()
During the EDA, one of the main focuses is the correlation beween variables. The function get_corr_matrix allows us to do this by computing the correlation matrix of the input dataframe using the specified method, and it sets the self-correlation to zero, and plots the resulting matrix as a heatmap using the seaborn library.
The funtion also has the method for the correlation as spearman and the function displays the plot using matplotlib.pyplot.show()
def cramers_v(var1,var2):
"""
calculate Cramers V statistic for categorial-categorial association.
uses correction from Bergsma and Wicher,
Journal of the Korean Statistical Society 42 (2013): 323-328
confusion_matrix: tabla creada con pd.crosstab()
"""
crosstab =np.array(pd.crosstab(var1,var2, rownames=None, colnames=None))
chi2 = ss.chi2_contingency(crosstab)[0]
n = crosstab.sum()
phi2 = chi2 / n
r, k = crosstab.shape
phi2corr = max(0, phi2 - ((k-1)*(r-1))/(n-1))
rcorr = r - ((r-1)**2)/(n-1)
kcorr = k - ((k-1)**2)/(n-1)
return np.sqrt(phi2corr / min((kcorr-1),(rcorr-1)))
The previous function facilitates us to plot the spearman and pearson correlation matrix. The cramers_v function rather than plotting creates the correlation matrix using the cramers v method.
def plot_feature(df, col_name, isContinuous, target):
"""
Visualize a variable with and without faceting on the loan status.
- df dataframe
- col_name is the variable name in the dataframe
- full_name is the full variable name
- continuous is True if the variable is continuous, False otherwise
"""
f, (ax1, ax2) = plt.subplots(nrows = 1, ncols = 2, figsize = (12,3), dpi = 90)
count_null = df[col_name].isnull().sum()
if isContinuous:
sns.histplot(df.loc[df[col_name].notnull(), col_name], kde = False, ax = ax1)
else:
sns.countplot(x=df[col_name], order=None,
color = '#5975A4', saturation = 1, ax = ax1)
ax1.set_xlabel(col_name)
ax1.set_ylabel('Count')
ax1.set_title(col_name+ ' Numero de nulos: ' + str(count_null))
plt.xticks(rotation = 90)
if isContinuous:
sns.boxplot(x = col_name, y = target, data = df, ax = ax2)
ax2.set_ylabel('')
ax2.set_title(col_name + ' by '+ target)
else:
data = df.groupby(col_name)[target].value_counts(normalize=True).to_frame('proportion').reset_index()
data.columns = [i, target, 'proportion']
sns.barplot(x = col_name, y = 'proportion', hue = target, data = data, saturation = 1, ax = ax2)
ax2.set_ylabel(target + ' fraction')
ax2.set_title(target)
plt.xticks(rotation = 90)
ax2.set_xlabel(col_name)
plt.tight_layout()
Being a supervised machine learning problem, the plot_feature allows us to look at the rest of the variables depending on the target variable value. The function creates a figure with two subplots.
The first subplot is a histogram or countplot of the values in the specified column, depending on whether the column is continuous or categorical.
The second subplot is either a boxplot or barplot showing the relationship between the values in the specified column and the target variable.
The function also displays the number of null values in the column in the title of the first subplot. Finally, the function rotates the x-axis labels for both subplots by 90 degrees and adjusts the layout of the figure using plt.tight_layout().
The data we will load for our Exploratory Data Analysis is the csv generated from the R Studio file. This is due to, such csv file contains the information selected in the previous notebook (1.Data selection & variables preparation) as well as the average income by zipcode in 2018 from The USCB (United States Census Bureau) though an API.
Also as we saw in the first notebook, the reviews were made from 2004 to 2018, hence the 2018 average income data is representative.
pd.set_option('display.max_columns', 500)
restaurants = pd.read_csv('../data/processed/restaurants.csv', dtype={'postal_code': str})
restaurants.head(3)
| business_id | name | address | city | state | postal_code | latitude | longitude | stars | review_count | is_open | restaurants | food | bars | sandwiches | american (traditional) | pizza | coffee & tea | breakfast & brunch | american (new) | fast food | burgers | mexican | italian | specialty food | seafood | desserts | bakeries | salad | chinese | cafes | chicken wings | ice cream & frozen yogurt | beer | sports bars | cocktail bars | juice bars & smoothies | barbeque | japanese | pubs | sushi bars | local flavor | asian fusion | diners | steakhouses | mediterranean | wine bars | southern | cajun/creole | donuts | tacos | soup | thai | beer bar | tex-mex | greek | breweries | vietnamese | chicken shop | hot dogs | indian | bagels | comfort food | cheesesteaks | ethnic food | caribbean | middle eastern | french | soul food | buffets | gastropubs | fruits & veggies | korean | bubble tea | tapas/small plates | food stands | cupcakes | noodles | wineries | cuban | spanish | acai bowls | RestaurantsDelivery | OutdoorSeating | BusinessAcceptsCreditCards | BikeParking | RestaurantsPriceRange2 | RestaurantsTakeOut | ByAppointmentOnly | WiFi | Alcohol | Caters | WheelchairAccessible | RestaurantsReservations | CoatCheck | DogsAllowed | RestaurantsTableService | HasTV | HappyHour | DriveThru | BusinessAcceptsBitcoin | Smoking | BYOBCorkage | AcceptsInsurance | RestaurantsCounterService | AgesAllowed | Premium_Parking | Non_Premium_Parking | dairy-free | gluten-free | vegan | kosher | halal | soy-free | vegetarian | num_days_open | num_hours_open | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | MTSW4McQd7CbVtyjqoe9mw | St Honore Pastries | 935 Race St | Philadelphia | PA | 19107 | 39.955505 | -75.155564 | 1 | 80 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | False | False | False | True | 1 | True | False | free | False | True | unknown | unknown | unknown | unknown | unknown | unknown | unknown | unknown | unknown | unknown | unknown | unknown | unknown | unknown | False | True | False | False | False | False | False | False | False | 7.0 | 94.0 |
| 1 | mWMc6_wTdE0EUBKIGXDVfA | Perkiomen Valley Brewery | 101 Walnut St | Green Lane | PA | 18054 | 40.338183 | -75.471659 | 1 | 13 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | unknown | unknown | True | True | 2 | True | unknown | unknown | unknown | False | True | unknown | unknown | unknown | unknown | unknown | unknown | unknown | unknown | unknown | unknown | unknown | unknown | unknown | False | True | False | False | False | False | False | False | False | 5.0 | 40.0 |
| 2 | CF33F8-E6oudUQ46HnavjQ | Sonic Drive-In | 615 S Main St | Ashland City | TN | 37015 | 36.269593 | -87.058943 | 0 | 6 | 1 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | True | True | True | False | 1 | True | False | False | False | False | True | False | False | False | False | True | False | True | unknown | unknown | unknown | unknown | unknown | unknown | False | False | False | False | False | False | False | False | False | 6.0 | 66.0 |
Average_income
We obtained the average income data set through the API from https://www.census.gov/ An official website of the United States government.
The download of this data is done through R. The procedure is attached in the path: '../Rcode/avgincome.R'.
the data set "ZipCode_income" is composed of two columns, GEOID with the zip code of each state of the country and average_income which, as the name of the column says, is the average income per zip code.
income = pd.read_csv('../data/raw/income.csv', index_col=0)
income.head()
| GEOID | average_income | |
|---|---|---|
| 1 | 35004 | 62902.0 |
| 2 | 35005 | 50473.0 |
| 3 | 35006 | 55909.0 |
| 4 | 35007 | 73587.0 |
| 5 | 35010 | 41954.0 |
We join the data set with a left join (so that it only joins the data where there is a restaurant).Then we drop the GEOID column.
income['GEOID'] = income['GEOID'].astype(str)
restaurants_income = pd.merge(restaurants, income, left_on='postal_code', right_on='GEOID', how='left')
restaurants_income = restaurants_income.drop('GEOID', axis=1)
restaurants_income.head()
| business_id | name | address | city | state | postal_code | latitude | longitude | stars | review_count | is_open | restaurants | food | bars | sandwiches | american (traditional) | pizza | coffee & tea | breakfast & brunch | american (new) | fast food | burgers | mexican | italian | specialty food | seafood | desserts | bakeries | salad | chinese | cafes | chicken wings | ice cream & frozen yogurt | beer | sports bars | cocktail bars | juice bars & smoothies | barbeque | japanese | pubs | sushi bars | local flavor | asian fusion | diners | steakhouses | mediterranean | wine bars | southern | cajun/creole | donuts | tacos | soup | thai | beer bar | tex-mex | greek | breweries | vietnamese | chicken shop | hot dogs | indian | bagels | comfort food | cheesesteaks | ethnic food | caribbean | middle eastern | french | soul food | buffets | gastropubs | fruits & veggies | korean | bubble tea | tapas/small plates | food stands | cupcakes | noodles | wineries | cuban | spanish | acai bowls | RestaurantsDelivery | OutdoorSeating | BusinessAcceptsCreditCards | BikeParking | RestaurantsPriceRange2 | RestaurantsTakeOut | ByAppointmentOnly | WiFi | Alcohol | Caters | WheelchairAccessible | RestaurantsReservations | CoatCheck | DogsAllowed | RestaurantsTableService | HasTV | HappyHour | DriveThru | BusinessAcceptsBitcoin | Smoking | BYOBCorkage | AcceptsInsurance | RestaurantsCounterService | AgesAllowed | Premium_Parking | Non_Premium_Parking | dairy-free | gluten-free | vegan | kosher | halal | soy-free | vegetarian | num_days_open | num_hours_open | average_income | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | MTSW4McQd7CbVtyjqoe9mw | St Honore Pastries | 935 Race St | Philadelphia | PA | 19107 | 39.955505 | -75.155564 | 1 | 80 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | False | False | False | True | 1 | True | False | free | False | True | unknown | unknown | unknown | unknown | unknown | unknown | unknown | unknown | unknown | unknown | unknown | unknown | unknown | unknown | False | True | False | False | False | False | False | False | False | 7.0 | 94.0 | 57735.0 |
| 1 | mWMc6_wTdE0EUBKIGXDVfA | Perkiomen Valley Brewery | 101 Walnut St | Green Lane | PA | 18054 | 40.338183 | -75.471659 | 1 | 13 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | unknown | unknown | True | True | 2 | True | unknown | unknown | unknown | False | True | unknown | unknown | unknown | unknown | unknown | unknown | unknown | unknown | unknown | unknown | unknown | unknown | unknown | False | True | False | False | False | False | False | False | False | 5.0 | 40.0 | 71196.0 |
| 2 | CF33F8-E6oudUQ46HnavjQ | Sonic Drive-In | 615 S Main St | Ashland City | TN | 37015 | 36.269593 | -87.058943 | 0 | 6 | 1 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | True | True | True | False | 1 | True | False | False | False | False | True | False | False | False | False | True | False | True | unknown | unknown | unknown | unknown | unknown | unknown | False | False | False | False | False | False | False | False | False | 6.0 | 66.0 | 54097.0 |
| 3 | k0hlBqXX-Bt0vf1op7Jr1w | Tsevi's Pub And Grill | 8025 Mackenzie Rd | Affton | MO | 63123 | 38.565165 | -90.321087 | 0 | 19 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | False | True | True | unknown | 1 | True | unknown | free | full_bar | True | unknown | False | unknown | unknown | unknown | True | unknown | unknown | unknown | unknown | unknown | unknown | unknown | unknown | False | True | False | False | False | False | False | False | False | NaN | NaN | 58867.0 |
| 4 | bBDDEgkFA1Otx9Lfe7BZUQ | Sonic Drive-In | 2312 Dickerson Pike | Nashville | TN | 37207 | 36.208102 | -86.768170 | 0 | 10 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | True | True | True | unknown | 1 | True | False | False | False | False | True | False | False | False | False | True | False | True | unknown | unknown | unknown | unknown | unknown | unknown | False | False | False | False | False | False | False | False | False | 6.0 | 76.0 | 42973.0 |
Duplicates observation
restaurants_income. duplicated().sum()
0
The absence of duplicates leave us with a the full length of observations as we had at the beginning, at least for the moment.
Setting business_id as index
Also since the business_id is a unique value for each business, we check if the number of unique values are the same as the length of the dataframe.
len(restaurants_income['business_id'].unique()) - len(restaurants_income)
0
restaurants_income.set_index('business_id', inplace=True)
By setting business_id as index we have it as identifier without loosing any data.
The dataset does not contain duplicated data.
plot_df = restaurants_income['stars']\
.value_counts(normalize = True)\
.mul(100).rename('percent').reset_index()
plot_df_conteo = restaurants_income['stars'].value_counts().reset_index()
plot_df_pc = pd.merge(plot_df, plot_df_conteo,
on = ['index'], how = 'inner')
plot_df_pc
| index | percent | stars | |
|---|---|---|---|
| 0 | 0 | 52.331885 | 26257 |
| 1 | 1 | 47.668115 | 23917 |
# create the barplot
sns.barplot(x='index', y='percent', data=plot_df_pc, palette=['darkred', 'darkgreen'])
# show the plot
plt.show()
We also took a look at this data in the previous notebbok. The current selection of the data (restaurants + the selected states), and the transformation of the stars variable made the target variable one practically totally balanced with 52.2% of the data as 0 (less than 4 stars) and 47.7% as 1 (successfull businesses).
In the previous notebook (1. Data selection & variables preparation) we already managed the missing values of the columns coming from attributes. Now we have four columns that have missing data, postal_code, average_income, num_days_open and num_hours_open.
postal_code
sum(restaurants_income['postal_code'].isna())
23
# Create a new dataframe with only the rows that have NA values in the column postal_code
postal_na = restaurants_income[restaurants_income["postal_code"].isna()]
# Select only the columns longitude and latitude from the new dataframe
postal_na = postal_na[["longitude", "latitude", "postal_code"]]
With the function get_postal_code we are able to obtain the postal code based on the latitude and longitude thanks to the Google Maps API
#Replace NA values in the postal_code column with the result of the get_postal_code function
postal_na['postal_code'] = postal_na[['latitude', 'longitude']].apply(
lambda x: get_postal_code(x['latitude'], x['longitude']), axis=1)
postal_na['postal_code'].astype(str)
postal_na.head(3)
| longitude | latitude | postal_code | |
|---|---|---|---|
| business_id | |||
| siwG4ZM7RjUDO52DI84m3w | -82.640292 | 27.767601 | 33701 |
| --sXnWH9Xm6_NvIjyuA99w | -75.165222 | 39.952584 | 19102 |
| UZ7mpi6qcsS0FqsH-cyfnA | -90.080654 | 29.934027 | 70130 |
We include this postal codes on the datafreame through a merge via left join
# Merge the two dataframes on the business_id column, using the suffixesparameter to specify different suffixes for the columns
merged = pd.merge(restaurants_income, postal_na, on="business_id", how="left", suffixes=("", "_na"))
# Replace the NA values in the postal_code column with the values from the postal_code_na column
merged["postal_code"] = merged["postal_code"].fillna(merged["postal_code_na"])
# Drop the postal_code_na column
restaurants_income = merged.drop(columns=["postal_code_na","longitude_na","latitude_na"])
restaurants_income[restaurants_income["postal_code"].isna()].head()
| name | address | city | state | postal_code | latitude | longitude | stars | review_count | is_open | restaurants | food | bars | sandwiches | american (traditional) | pizza | coffee & tea | breakfast & brunch | american (new) | fast food | burgers | mexican | italian | specialty food | seafood | desserts | bakeries | salad | chinese | cafes | chicken wings | ice cream & frozen yogurt | beer | sports bars | cocktail bars | juice bars & smoothies | barbeque | japanese | pubs | sushi bars | local flavor | asian fusion | diners | steakhouses | mediterranean | wine bars | southern | cajun/creole | donuts | tacos | soup | thai | beer bar | tex-mex | greek | breweries | vietnamese | chicken shop | hot dogs | indian | bagels | comfort food | cheesesteaks | ethnic food | caribbean | middle eastern | french | soul food | buffets | gastropubs | fruits & veggies | korean | bubble tea | tapas/small plates | food stands | cupcakes | noodles | wineries | cuban | spanish | acai bowls | RestaurantsDelivery | OutdoorSeating | BusinessAcceptsCreditCards | BikeParking | RestaurantsPriceRange2 | RestaurantsTakeOut | ByAppointmentOnly | WiFi | Alcohol | Caters | WheelchairAccessible | RestaurantsReservations | CoatCheck | DogsAllowed | RestaurantsTableService | HasTV | HappyHour | DriveThru | BusinessAcceptsBitcoin | Smoking | BYOBCorkage | AcceptsInsurance | RestaurantsCounterService | AgesAllowed | Premium_Parking | Non_Premium_Parking | dairy-free | gluten-free | vegan | kosher | halal | soy-free | vegetarian | num_days_open | num_hours_open | average_income | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| business_id | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| MxSSESMG5NJCDHy_zsmNCw | Fire Monkey Food Truck | NaN | Saint Petersburg | FL | None | 27.751828 | -82.626734 | 1 | 6 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | unknown | unknown | True | unknown | 2 | unknown | unknown | unknown | unknown | unknown | unknown | unknown | unknown | unknown | unknown | unknown | unknown | unknown | unknown | unknown | unknown | unknown | unknown | unknown | True | True | False | False | False | False | False | False | False | NaN | NaN | NaN |
Now we only have one NA value for postal code. As we see in the name, the business is a food truck, hence it does not have a fix postal code. For this reason we will drop it from our dataset.
restaurants_income = restaurants_income.drop('MxSSESMG5NJCDHy_zsmNCw', axis=0)
sum(restaurants_income['postal_code'].isna())
0
average_income missing values
The dataset provided by the USCB has some missing data.
sum(restaurants_income['average_income'].isna())
538
For these we will replace them with the mean of the column "average_income" of the correspondant state:
# group the dataframe by the state column and calculate the median average_income for each group
median_income_by_state = restaurants_income.groupby('state')['average_income'].median()
# fill the missing values in the average_income column with the median income for the state
restaurants_income['average_income'] = restaurants_income['average_income'].fillna(
restaurants_income['state'].map(median_income_by_state))
sum(restaurants_income['average_income'].isna())
0
num_days_open & num_hours_open missing values
For these two variables we will replace the missing value with the median value of each column,
sum(restaurants_income['num_days_open'].isna())
7394
sum(restaurants_income['num_hours_open'].isna())
7394
# Calculate the medians of both columns
median_days_open = restaurants_income['num_days_open'].median()
median_hours_open = restaurants_income['num_hours_open'].median()
# Use the fillna method to fill the NA values with the medians
restaurants_income = restaurants_income.fillna({
'num_days_open': median_days_open,
'num_hours_open': median_hours_open
})
sum(restaurants_income['num_hours_open'].isna())
0
sum(restaurants_income['num_days_open'].isna())
0
Looking at the variables there is one possible column with outliers. The average_income. Besides that, the variables we have left are categorical or in a small range of numbers, hence the existence of outliers is nothing but difficult.
average_income
sns.boxplot(x='stars', y='average_income', data=restaurants_income)
<AxesSubplot: xlabel='stars', ylabel='average_income'>
Despite we observe some outlier values, especially from $150.000 of avergae income, we observe these are balanced between the values of the target variable. Due to the nature of the average_income and the balanced values, we will not take further action.
Spearman
df_spearman = restaurants_income[['name', 'address', 'city', 'state', 'postal_code', 'latitude', 'longitude',
'is_open', 'restaurants', 'food', 'bars', 'sandwiches', 'american (traditional)',
'pizza', 'coffee & tea', 'breakfast & brunch', 'american (new)', 'fast food', 'burgers',
'mexican', 'italian', 'specialty food', 'seafood', 'desserts', 'bakeries', 'salad', 'chinese',
'cafes', 'chicken wings', 'ice cream & frozen yogurt', 'beer', 'sports bars', 'cocktail bars',
'juice bars & smoothies', 'barbeque', 'japanese', 'pubs', 'sushi bars', 'local flavor',
'asian fusion', 'diners', 'steakhouses', 'mediterranean', 'wine bars', 'vegetarian',
'southern', 'cajun/creole', 'donuts', 'tacos', 'soup', 'thai', 'beer bar', 'tex-mex', 'vegan',
'greek', 'breweries', 'vietnamese', 'chicken shop', 'hot dogs', 'indian', 'gluten-free',
'bagels', 'comfort food', 'cheesesteaks', 'ethnic food', 'caribbean', 'middle eastern',
'french', 'soul food', 'buffets', 'gastropubs', 'fruits & veggies', 'korean', 'bubble tea',
'tapas/small plates', 'food stands', 'cupcakes', 'noodles', 'wineries', 'cuban', 'halal',
'spanish', 'acai bowls', 'RestaurantsDelivery', 'OutdoorSeating', 'BusinessAcceptsCreditCards',
'BikeParking', 'RestaurantsPriceRange2', 'RestaurantsTakeOut', 'ByAppointmentOnly', 'WiFi',
'Alcohol', 'Caters', 'WheelchairAccessible', 'RestaurantsReservations', 'CoatCheck',
'DogsAllowed', 'RestaurantsTableService', 'HasTV', 'HappyHour', 'DriveThru',
'BusinessAcceptsBitcoin', 'Smoking', 'BYOBCorkage', 'AcceptsInsurance',
'RestaurantsCounterService', 'AgesAllowed', 'Premium_Parking', 'Non_Premium_Parking',
'review_count', 'num_days_open', 'num_hours_open', 'average_income']]
First, in order to know the correlation between variables, we will plot a heatmat of the spearman correlations between them. For that we will call the funtion _get_corrmatrix from the functions notebook. The function will only call for numerical variables. After, we will do a Cramer's V for all categorical variables (despite having numeric or string values).
get_corr_matrix(dataset = df_spearman, size_figure = [10,8])
spearman = df_spearman.corr(method = 'spearman')
spearman
| latitude | longitude | is_open | restaurants | food | bars | sandwiches | american (traditional) | pizza | coffee & tea | breakfast & brunch | american (new) | fast food | burgers | mexican | italian | specialty food | seafood | desserts | bakeries | salad | chinese | cafes | chicken wings | ice cream & frozen yogurt | beer | sports bars | cocktail bars | juice bars & smoothies | barbeque | japanese | pubs | sushi bars | local flavor | asian fusion | diners | steakhouses | mediterranean | wine bars | vegetarian | southern | cajun/creole | donuts | tacos | soup | thai | beer bar | tex-mex | vegan | greek | breweries | vietnamese | chicken shop | hot dogs | indian | gluten-free | bagels | comfort food | cheesesteaks | ethnic food | caribbean | middle eastern | french | soul food | buffets | gastropubs | fruits & veggies | korean | bubble tea | tapas/small plates | food stands | cupcakes | noodles | wineries | cuban | halal | spanish | acai bowls | RestaurantsPriceRange2 | review_count | num_days_open | num_hours_open | average_income | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| latitude | 1.000000 | 0.325075 | -0.016491 | 0.027388 | 0.001211 | -0.041527 | 0.008887 | -0.007611 | 0.063944 | 0.011912 | -0.011915 | 0.005900 | -0.028787 | -0.010872 | -0.026744 | 0.041775 | -0.001523 | -0.063391 | -0.005809 | 0.019454 | -0.002704 | 0.039429 | -0.020700 | -0.033794 | -0.000629 | -0.003186 | -0.027068 | -0.022668 | -0.017733 | -0.010750 | 0.021157 | 0.015554 | -0.012032 | -0.028461 | 0.014077 | 0.004034 | -0.008005 | 0.006200 | -0.024033 | 0.003997 | -0.036719 | -0.066683 | 0.003028 | -0.026553 | 0.002496 | -0.004158 | -0.006759 | -0.016743 | 0.002925 | -0.033908 | 0.006665 | -0.002366 | -0.017584 | -0.009597 | 0.032477 | 0.002528 | 0.023600 | -0.005653 | 0.057046 | 0.001777 | -0.027089 | 0.005478 | 0.007847 | -0.004826 | -0.000143 | 0.019532 | 0.001137 | 0.033884 | 0.004235 | -0.012490 | 0.002722 | 0.000773 | 0.009050 | -0.001909 | -0.074600 | -0.004028 | -0.027914 | -0.019389 | 0.011775 | -0.038601 | 0.000701 | 0.006403 | 0.359561 |
| longitude | 0.325075 | 1.000000 | -0.016811 | 0.033527 | -0.006060 | -0.030242 | 0.018106 | -0.011439 | 0.066910 | 0.015464 | -0.014608 | -0.003613 | -0.072733 | -0.044886 | -0.081832 | 0.055422 | 0.004197 | 0.001256 | -0.009048 | 0.001645 | -0.006184 | 0.032571 | -0.000295 | -0.009245 | -0.001785 | -0.018407 | -0.016326 | -0.034118 | -0.005286 | -0.021464 | 0.022974 | 0.014063 | 0.011056 | -0.032743 | 0.019378 | 0.006938 | -0.004152 | 0.012926 | -0.035545 | -0.000707 | -0.026399 | -0.060123 | 0.025470 | -0.038881 | 0.001267 | 0.019138 | -0.013772 | -0.021209 | -0.006663 | -0.000599 | -0.007127 | 0.004681 | 0.000043 | -0.008254 | 0.037610 | -0.002816 | 0.024452 | -0.001551 | 0.082388 | 0.010897 | 0.032513 | 0.013511 | -0.001033 | 0.005380 | -0.003452 | 0.015527 | 0.018506 | 0.021453 | 0.012110 | -0.008728 | -0.001120 | 0.002329 | 0.011214 | -0.040830 | 0.016522 | -0.003285 | 0.019574 | 0.000229 | 0.022546 | -0.043040 | 0.032250 | 0.012135 | 0.215675 |
| is_open | -0.016491 | -0.016811 | 1.000000 | -0.099063 | 0.078308 | -0.005702 | 0.027231 | -0.003834 | 0.014837 | 0.033827 | 0.013118 | -0.062750 | 0.108176 | 0.046277 | 0.011828 | -0.026212 | 0.021357 | 0.003068 | -0.003407 | 0.001332 | 0.022772 | 0.012327 | -0.027402 | 0.043725 | -0.010118 | 0.019261 | 0.003912 | 0.029781 | 0.014090 | -0.042038 | -0.017849 | -0.013137 | -0.011691 | 0.038514 | -0.022316 | -0.015380 | -0.004589 | -0.038714 | -0.003470 | -0.003672 | -0.010611 | -0.020656 | 0.037624 | 0.046675 | 0.014439 | -0.017961 | 0.029283 | 0.016692 | 0.000792 | -0.025627 | 0.021372 | -0.025252 | 0.042816 | -0.000276 | -0.038791 | -0.000546 | 0.000922 | 0.004559 | 0.001948 | -0.011572 | -0.005214 | -0.024092 | -0.045955 | -0.011852 | -0.041484 | -0.011695 | 0.025531 | -0.018817 | 0.010778 | -0.016890 | -0.003457 | 0.006251 | 0.017769 | 0.008283 | -0.009802 | 0.003041 | -0.016603 | 0.012109 | -0.019856 | 0.147521 | -0.041420 | 0.050457 | -0.023151 |
| restaurants | 0.027388 | 0.033527 | -0.099063 | 1.000000 | -0.416698 | -0.022544 | 0.206131 | 0.202098 | 0.180776 | -0.123121 | 0.176980 | 0.176369 | 0.170524 | 0.159859 | 0.145903 | 0.145114 | -0.181836 | 0.132766 | -0.101108 | -0.079704 | 0.118613 | 0.117191 | 0.115171 | 0.114753 | -0.178104 | -0.129852 | 0.032582 | 0.020348 | -0.060231 | 0.089521 | 0.089278 | 0.000583 | 0.087808 | -0.183017 | 0.083038 | 0.080405 | 0.079753 | 0.077847 | 0.007545 | 0.007275 | 0.071257 | 0.070009 | -0.101027 | 0.067985 | 0.067717 | 0.066315 | -0.012681 | 0.064088 | 0.006416 | 0.061345 | -0.079923 | 0.059449 | 0.059196 | 0.058170 | 0.057755 | 0.008401 | 0.008634 | 0.056279 | 0.053436 | 0.006268 | 0.052872 | 0.052645 | 0.050201 | 0.050022 | 0.048995 | 0.048505 | -0.110171 | 0.046494 | -0.038124 | 0.046108 | 0.045193 | -0.096098 | 0.040173 | -0.093867 | 0.040099 | 0.002425 | 0.038742 | -0.015759 | -0.044820 | 0.259518 | -0.044256 | -0.094991 | 0.001212 |
| food | 0.001211 | -0.006060 | 0.078308 | -0.416698 | 1.000000 | -0.183220 | -0.007051 | -0.144637 | -0.152022 | 0.393910 | -0.006328 | -0.100065 | -0.068056 | -0.036670 | -0.116095 | -0.107865 | 0.307835 | -0.096580 | 0.270057 | 0.261544 | -0.013312 | -0.136172 | 0.087565 | -0.107508 | 0.242164 | 0.231579 | -0.093200 | -0.057435 | 0.197325 | -0.050810 | -0.083570 | -0.046726 | -0.081191 | -0.035714 | -0.055632 | -0.065738 | -0.087344 | -0.049839 | -0.005708 | -0.005238 | -0.049943 | -0.052708 | 0.152506 | -0.043367 | -0.001855 | -0.062465 | 0.007830 | -0.071902 | -0.003097 | -0.050494 | 0.131090 | -0.028963 | -0.056097 | 0.003794 | -0.032220 | 0.005288 | 0.125369 | -0.015441 | -0.030538 | 0.115854 | -0.029956 | -0.023907 | -0.024862 | -0.034143 | -0.047029 | -0.019500 | 0.104430 | -0.038287 | 0.101521 | -0.028770 | 0.043011 | 0.094243 | -0.014992 | 0.087677 | -0.009282 | -0.003759 | -0.023586 | 0.081987 | -0.056956 | -0.123659 | -0.035687 | 0.065196 | 0.030264 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| RestaurantsPriceRange2 | 0.011775 | 0.022546 | -0.019856 | -0.044820 | -0.056956 | 0.174575 | -0.149228 | 0.055691 | -0.059063 | -0.117828 | -0.043575 | 0.134983 | -0.299495 | -0.117304 | -0.066565 | 0.059230 | 0.043334 | 0.133241 | 0.001894 | -0.012124 | 0.010729 | -0.047298 | -0.020085 | -0.072414 | -0.101493 | 0.071743 | 0.045963 | 0.108846 | -0.007134 | 0.051308 | 0.082806 | 0.033375 | 0.095492 | 0.053476 | 0.057409 | -0.054332 | 0.116241 | 0.026844 | 0.104762 | 0.006891 | 0.046667 | 0.063355 | -0.110134 | -0.042771 | -0.001802 | 0.054308 | 0.037587 | -0.048443 | 0.000173 | -0.012945 | 0.027752 | -0.014631 | -0.057228 | -0.061679 | 0.040224 | 0.001915 | -0.065194 | 0.012942 | -0.031450 | -0.029234 | 0.014855 | 0.003793 | 0.090281 | 0.020436 | 0.031339 | 0.049660 | -0.021295 | 0.022149 | -0.013270 | 0.065606 | -0.042844 | 0.021478 | 0.019894 | 0.046463 | -0.021933 | -0.005680 | 0.011003 | 0.025986 | 1.000000 | 0.125927 | -0.124530 | -0.149190 | 0.090668 |
| review_count | -0.038601 | -0.043040 | 0.147521 | 0.259518 | -0.123659 | 0.180616 | 0.021593 | 0.149602 | 0.016604 | -0.014967 | 0.142104 | 0.173294 | -0.107139 | 0.023072 | 0.052943 | 0.087915 | -0.053072 | 0.127676 | 0.035722 | 0.000167 | 0.095399 | 0.016357 | 0.042472 | -0.023786 | -0.036865 | -0.000203 | 0.054833 | 0.119406 | -0.012987 | 0.042282 | 0.091009 | 0.070931 | 0.105363 | -0.025446 | 0.081959 | 0.054794 | 0.102339 | 0.054929 | 0.087012 | 0.012734 | 0.077455 | 0.064167 | -0.030412 | 0.002299 | 0.055264 | 0.071608 | 0.061373 | 0.025700 | 0.005973 | 0.033859 | 0.048662 | 0.051966 | -0.002811 | -0.009589 | 0.039678 | 0.013862 | 0.021912 | 0.054818 | 0.012933 | 0.013523 | -0.002015 | 0.034317 | 0.058017 | 0.000185 | 0.022690 | 0.077928 | -0.036513 | 0.038349 | 0.026013 | 0.061668 | -0.028448 | -0.005197 | 0.049837 | 0.004288 | 0.015833 | 0.006506 | 0.020011 | 0.000339 | 0.125927 | 1.000000 | -0.168173 | -0.122808 | 0.053423 |
| num_days_open | 0.000701 | 0.032250 | -0.041420 | -0.044256 | -0.035687 | -0.022900 | -0.031118 | -0.005622 | 0.047547 | 0.018008 | -0.031066 | -0.057206 | 0.070653 | 0.034801 | 0.017350 | -0.032967 | -0.056696 | -0.038864 | -0.078001 | -0.060153 | -0.064256 | 0.041239 | -0.064535 | 0.041727 | 0.015450 | -0.014619 | 0.033225 | -0.068469 | -0.042390 | -0.055032 | -0.016986 | 0.016189 | -0.015693 | -0.006267 | -0.035684 | 0.025539 | -0.014630 | -0.027978 | -0.068212 | -0.007758 | -0.058011 | -0.024884 | 0.044336 | 0.004348 | -0.032706 | -0.038574 | -0.036578 | 0.034141 | -0.010016 | -0.004536 | -0.047782 | -0.029459 | -0.015652 | -0.007527 | -0.015011 | -0.009248 | 0.013522 | -0.045215 | -0.001421 | -0.004102 | -0.040617 | -0.009828 | -0.055770 | -0.049740 | 0.025247 | -0.014441 | -0.012133 | -0.020854 | -0.011974 | -0.044361 | -0.025905 | -0.052264 | -0.033667 | -0.037557 | -0.031569 | -0.004547 | -0.029043 | -0.027323 | -0.124530 | -0.168173 | 1.000000 | 0.537661 | -0.008376 |
| num_hours_open | 0.006403 | 0.012135 | 0.050457 | -0.094991 | 0.065196 | -0.069278 | 0.005760 | -0.021950 | 0.009873 | 0.110445 | -0.032414 | -0.071355 | 0.100346 | 0.032577 | 0.032558 | -0.049348 | -0.031610 | -0.047503 | -0.067458 | -0.028276 | -0.023859 | 0.059086 | -0.060501 | 0.022808 | 0.012618 | -0.003444 | -0.017836 | -0.050506 | 0.007645 | -0.045579 | -0.040901 | -0.017081 | -0.044865 | -0.048343 | -0.027753 | 0.036319 | -0.053329 | -0.028734 | -0.073837 | -0.002338 | -0.040655 | -0.025070 | 0.093133 | -0.007760 | -0.002420 | -0.052905 | -0.042420 | -0.007357 | -0.002805 | -0.007931 | -0.065599 | -0.008813 | 0.031857 | -0.019607 | -0.047446 | -0.002684 | 0.007164 | -0.015508 | 0.002523 | -0.002797 | -0.025457 | -0.005421 | -0.066911 | -0.046572 | 0.007105 | -0.019985 | 0.007983 | -0.017693 | -0.005085 | -0.048812 | -0.035907 | -0.037912 | -0.015266 | -0.050083 | -0.017790 | -0.001741 | -0.022737 | -0.002260 | -0.149190 | -0.122808 | 0.537661 | 1.000000 | -0.000869 |
| average_income | 0.359561 | 0.215675 | -0.023151 | 0.001212 | 0.030264 | -0.021414 | 0.019847 | 0.014241 | 0.039720 | 0.003601 | -0.001124 | 0.030801 | -0.033753 | -0.017167 | -0.031418 | 0.068954 | 0.007227 | -0.024857 | 0.021077 | 0.025763 | 0.028924 | -0.005995 | 0.004635 | -0.044579 | 0.039999 | 0.012880 | -0.002241 | -0.015522 | 0.015229 | -0.008137 | 0.031638 | 0.014027 | 0.030476 | -0.029968 | 0.019031 | -0.006040 | 0.018106 | 0.009765 | 0.009732 | 0.010919 | -0.020076 | -0.070676 | 0.003136 | -0.021179 | 0.015970 | 0.023980 | -0.002817 | -0.002368 | 0.007632 | -0.002377 | -0.000028 | -0.027073 | -0.018937 | -0.008414 | 0.017024 | 0.009383 | 0.030172 | -0.008955 | 0.018234 | -0.019367 | -0.059040 | -0.017895 | 0.017151 | -0.059997 | -0.010188 | 0.012830 | 0.010252 | 0.000171 | -0.018902 | 0.002609 | -0.033796 | 0.022176 | -0.001447 | 0.014389 | -0.031961 | 0.001174 | -0.023642 | 0.023190 | 0.090668 | 0.053423 | -0.008376 | -0.000869 | 1.000000 |
83 rows × 83 columns
After checking the results, the only correlating higher than 0.6 in absolute termns is the correlation between the variables vegan and vegetarian. Since all vegan restaurants can also be vonsider vegetarian, as well as, many vegetarian restaurants also provides a huge amount of vegan plates they can get the "vegan" attribute. This variables have a correlation of 0.755891.
Cramer's V
First we define the categorical variables (excluding the target variable)
df_categorical_variables = restaurants_income[['city', 'state', 'postal_code',
'is_open', 'restaurants', 'food', 'bars', 'sandwiches', 'american (traditional)',
'pizza', 'coffee & tea', 'breakfast & brunch', 'american (new)', 'fast food', 'burgers',
'mexican', 'italian', 'specialty food', 'seafood', 'desserts', 'bakeries', 'salad', 'chinese',
'cafes', 'chicken wings', 'ice cream & frozen yogurt', 'beer', 'sports bars', 'cocktail bars',
'juice bars & smoothies', 'barbeque', 'japanese', 'pubs', 'sushi bars', 'local flavor',
'asian fusion', 'diners', 'steakhouses', 'mediterranean', 'wine bars', 'vegetarian',
'southern', 'cajun/creole', 'donuts', 'tacos', 'soup', 'thai', 'beer bar', 'tex-mex', 'vegan',
'greek', 'breweries', 'vietnamese', 'chicken shop', 'hot dogs', 'indian', 'gluten-free',
'bagels', 'comfort food', 'cheesesteaks', 'ethnic food', 'caribbean', 'middle eastern',
'french', 'soul food', 'buffets', 'gastropubs', 'fruits & veggies', 'korean', 'bubble tea',
'tapas/small plates', 'food stands', 'cupcakes', 'noodles', 'wineries', 'cuban', 'halal',
'spanish', 'acai bowls', 'RestaurantsDelivery', 'OutdoorSeating', 'BusinessAcceptsCreditCards',
'BikeParking', 'RestaurantsPriceRange2', 'RestaurantsTakeOut', 'ByAppointmentOnly', 'WiFi',
'Alcohol', 'Caters', 'WheelchairAccessible', 'RestaurantsReservations', 'CoatCheck',
'DogsAllowed', 'RestaurantsTableService', 'HasTV', 'HappyHour', 'DriveThru',
'BusinessAcceptsBitcoin', 'Smoking', 'BYOBCorkage', 'AcceptsInsurance',
'RestaurantsCounterService', 'AgesAllowed', 'Premium_Parking', 'Non_Premium_Parking']]
Then, we call from the fucntions auxiliar notebook the _cramersv function, and we plot its heatmap
# Initialize an empty list to store the results of the Cramer's V test
rows = []
# Loop over all pairs of categorical variables in df_categorical_variables
for var1 in df_categorical_variables:
col = []
for var2 in df_categorical_variables:
# Calculate Cramer's V statistic for the current pair of variables
cramers = cramers_v(df_categorical_variables[var1], df_categorical_variables[var2])
# Round the Cramer's V statistic to two decimal places and add it to the list col
col.append(round(cramers, 2))
# Add the list col to the list rows
rows.append(col)
# Convert the list of lists to a NumPy array and use it to create a new dataframe df_vcramer
cramers_results = np.array(rows)
df_vcramer = pd.DataFrame(cramers_results, columns=df_categorical_variables.columns, index=df_categorical_variables.columns)
# Plot a heatmap of the p-values in df_vcramer using seaborn
sns.heatmap(df_vcramer, vmin=0, vmax=1, square=True, linewidths=0.5, cmap='coolwarm')
<AxesSubplot: >
df_vcramer
| city | state | postal_code | is_open | restaurants | food | bars | sandwiches | american (traditional) | pizza | coffee & tea | breakfast & brunch | american (new) | fast food | burgers | mexican | italian | specialty food | seafood | desserts | bakeries | salad | chinese | cafes | chicken wings | ice cream & frozen yogurt | beer | sports bars | cocktail bars | juice bars & smoothies | barbeque | japanese | pubs | sushi bars | local flavor | asian fusion | diners | steakhouses | mediterranean | wine bars | vegetarian | southern | cajun/creole | donuts | tacos | soup | thai | beer bar | tex-mex | vegan | greek | breweries | vietnamese | chicken shop | hot dogs | indian | gluten-free | bagels | comfort food | cheesesteaks | ethnic food | caribbean | middle eastern | french | soul food | buffets | gastropubs | fruits & veggies | korean | bubble tea | tapas/small plates | food stands | cupcakes | noodles | wineries | cuban | halal | spanish | acai bowls | RestaurantsDelivery | OutdoorSeating | BusinessAcceptsCreditCards | BikeParking | RestaurantsPriceRange2 | RestaurantsTakeOut | ByAppointmentOnly | WiFi | Alcohol | Caters | WheelchairAccessible | RestaurantsReservations | CoatCheck | DogsAllowed | RestaurantsTableService | HasTV | HappyHour | DriveThru | BusinessAcceptsBitcoin | Smoking | BYOBCorkage | AcceptsInsurance | RestaurantsCounterService | AgesAllowed | Premium_Parking | Non_Premium_Parking | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| city | 1.00 | 0.99 | 0.65 | 0.12 | 0.09 | 0.08 | 0.13 | 0.05 | 0.09 | 0.12 | 0.03 | 0.05 | 0.08 | 0.18 | 0.13 | 0.12 | 0.11 | 0.04 | 0.15 | 0.04 | 0.04 | 0.04 | 0.04 | 0.02 | 0.10 | 0.08 | 0.00 | 0.04 | 0.05 | 0.00 | 0.07 | 0.00 | 0.05 | 0.00 | 0.11 | 0.00 | 0.09 | 0.06 | 0.08 | 0.07 | 0.02 | 0.11 | 0.24 | 0.00 | 0.07 | 0.02 | 0.03 | 0.00 | 0.00 | 0.00 | 0.09 | 0.07 | 0.04 | 0.00 | 0.04 | 0.10 | 0.00 | 0.00 | 0.06 | 0.12 | 0.02 | 0.06 | 0.05 | 0.08 | 0.05 | 0.00 | 0.00 | 0.08 | 0.09 | 0.00 | 0.00 | 0.10 | 0.07 | 0.00 | 0.17 | 0.11 | 0.00 | 0.03 | 0.00 | 0.12 | 0.16 | 0.13 | 0.14 | 0.10 | 0.08 | 0.06 | 0.07 | 0.19 | 0.08 | 0.10 | 0.10 | 0.08 | 0.13 | 0.08 | 0.09 | 0.10 | 0.11 | 0.03 | 0.06 | 0.02 | 0.15 | 0.00 | 0.00 | 0.13 | 0.11 |
| state | 0.99 | 1.00 | 0.99 | 0.05 | 0.06 | 0.05 | 0.06 | 0.02 | 0.05 | 0.09 | 0.03 | 0.02 | 0.03 | 0.11 | 0.06 | 0.12 | 0.07 | 0.02 | 0.10 | 0.01 | 0.03 | 0.02 | 0.05 | 0.02 | 0.05 | 0.01 | 0.04 | 0.04 | 0.05 | 0.03 | 0.05 | 0.03 | 0.04 | 0.03 | 0.07 | 0.02 | 0.02 | 0.02 | 0.01 | 0.05 | 0.00 | 0.13 | 0.23 | 0.03 | 0.05 | 0.02 | 0.03 | 0.04 | 0.04 | 0.00 | 0.05 | 0.03 | 0.04 | 0.03 | 0.03 | 0.04 | 0.01 | 0.03 | 0.01 | 0.09 | 0.00 | 0.05 | 0.02 | 0.05 | 0.04 | 0.02 | 0.02 | 0.02 | 0.03 | 0.01 | 0.02 | 0.02 | 0.00 | 0.01 | 0.12 | 0.10 | 0.01 | 0.04 | 0.03 | 0.08 | 0.10 | 0.09 | 0.09 | 0.05 | 0.05 | 0.05 | 0.04 | 0.17 | 0.04 | 0.08 | 0.05 | 0.06 | 0.10 | 0.04 | 0.06 | 0.05 | 0.07 | 0.03 | 0.05 | 0.07 | 0.01 | 0.02 | 0.02 | 0.05 | 0.04 |
| postal_code | 0.65 | 0.99 | 1.00 | 0.17 | 0.10 | 0.10 | 0.21 | 0.05 | 0.10 | 0.13 | 0.03 | 0.05 | 0.11 | 0.20 | 0.10 | 0.15 | 0.12 | 0.07 | 0.15 | 0.06 | 0.06 | 0.04 | 0.09 | 0.06 | 0.14 | 0.03 | 0.05 | 0.06 | 0.12 | 0.05 | 0.08 | 0.02 | 0.10 | 0.00 | 0.15 | 0.03 | 0.05 | 0.07 | 0.05 | 0.09 | 0.04 | 0.13 | 0.26 | 0.03 | 0.04 | 0.00 | 0.07 | 0.07 | 0.00 | 0.00 | 0.04 | 0.09 | 0.10 | 0.00 | 0.03 | 0.11 | 0.00 | 0.05 | 0.00 | 0.11 | 0.07 | 0.11 | 0.04 | 0.09 | 0.12 | 0.07 | 0.00 | 0.09 | 0.16 | 0.05 | 0.06 | 0.11 | 0.02 | 0.00 | 0.18 | 0.12 | 0.00 | 0.09 | 0.00 | 0.15 | 0.21 | 0.16 | 0.19 | 0.14 | 0.11 | 0.07 | 0.09 | 0.22 | 0.11 | 0.13 | 0.13 | 0.10 | 0.17 | 0.09 | 0.12 | 0.13 | 0.12 | 0.08 | 0.11 | 0.09 | 0.00 | 0.14 | 0.00 | 0.24 | 0.16 |
| is_open | 0.12 | 0.05 | 0.17 | 1.00 | 0.10 | 0.08 | 0.00 | 0.03 | 0.00 | 0.01 | 0.03 | 0.01 | 0.06 | 0.11 | 0.05 | 0.01 | 0.03 | 0.02 | 0.00 | 0.00 | 0.00 | 0.02 | 0.01 | 0.03 | 0.04 | 0.01 | 0.02 | 0.00 | 0.03 | 0.01 | 0.04 | 0.02 | 0.01 | 0.01 | 0.04 | 0.02 | 0.01 | 0.00 | 0.04 | 0.00 | 0.00 | 0.01 | 0.02 | 0.04 | 0.05 | 0.01 | 0.02 | 0.03 | 0.02 | 0.00 | 0.03 | 0.02 | 0.02 | 0.04 | 0.00 | 0.04 | 0.00 | 0.00 | 0.00 | 0.00 | 0.01 | 0.00 | 0.02 | 0.05 | 0.01 | 0.04 | 0.01 | 0.02 | 0.02 | 0.01 | 0.02 | 0.00 | 0.00 | 0.02 | 0.01 | 0.01 | 0.00 | 0.02 | 0.01 | 0.27 | 0.13 | 0.10 | 0.13 | 0.04 | 0.06 | 0.11 | 0.03 | 0.13 | 0.06 | 0.12 | 0.14 | 0.06 | 0.15 | 0.08 | 0.16 | 0.10 | 0.17 | 0.05 | 0.04 | 0.04 | 0.01 | 0.01 | 0.02 | 0.06 | 0.06 |
| restaurants | 0.09 | 0.06 | 0.10 | 0.10 | 1.00 | 0.42 | 0.02 | 0.21 | 0.20 | 0.18 | 0.12 | 0.18 | 0.18 | 0.17 | 0.16 | 0.15 | 0.14 | 0.18 | 0.13 | 0.10 | 0.08 | 0.12 | 0.12 | 0.11 | 0.11 | 0.18 | 0.13 | 0.03 | 0.02 | 0.06 | 0.09 | 0.09 | 0.00 | 0.09 | 0.18 | 0.08 | 0.08 | 0.08 | 0.08 | 0.01 | 0.00 | 0.07 | 0.07 | 0.10 | 0.07 | 0.07 | 0.07 | 0.01 | 0.06 | 0.00 | 0.06 | 0.08 | 0.06 | 0.06 | 0.06 | 0.06 | 0.01 | 0.01 | 0.06 | 0.05 | 0.00 | 0.05 | 0.05 | 0.05 | 0.05 | 0.05 | 0.05 | 0.11 | 0.05 | 0.04 | 0.05 | 0.04 | 0.10 | 0.04 | 0.09 | 0.04 | 0.00 | 0.04 | 0.01 | 0.53 | 0.52 | 0.06 | 0.03 | 0.04 | 0.45 | 0.11 | 0.35 | 0.52 | 0.30 | 0.04 | 0.62 | 0.03 | 0.10 | 0.35 | 0.56 | 0.15 | 0.03 | 0.05 | 0.06 | 0.09 | 0.05 | 0.01 | 0.04 | 0.03 | 0.05 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| AcceptsInsurance | 0.15 | 0.01 | 0.00 | 0.01 | 0.05 | 0.03 | 0.01 | 0.01 | 0.01 | 0.00 | 0.00 | 0.01 | 0.01 | 0.00 | 0.01 | 0.01 | 0.01 | 0.05 | 0.01 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.04 | 0.00 | 0.00 | 0.00 | 0.00 | 0.01 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.02 | 0.03 | 0.00 | 0.01 | 0.01 | 0.03 | 0.10 | 0.01 | 0.03 | 0.01 | 0.02 | 0.03 | 0.00 | 0.02 | 0.01 | 0.03 | 0.01 | 0.01 | 0.02 | 0.00 | 0.00 | 1.00 | 0.00 | 0.00 | 0.01 | 0.01 |
| RestaurantsCounterService | 0.00 | 0.02 | 0.14 | 0.01 | 0.01 | 0.00 | 0.01 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.01 | 0.01 | 0.00 | 0.00 | 0.00 | 0.00 | 0.02 | 0.00 | 0.00 | 0.03 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.01 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.01 | 0.00 | 0.00 | 0.00 | 0.26 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.29 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.15 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.01 | 0.00 | 0.00 | 0.01 | 0.00 | 0.00 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.02 | 0.01 | 0.01 | 0.01 | 0.01 | 0.02 | 0.00 | 0.00 | 0.06 | 0.00 | 1.00 | 0.00 | 0.01 | 0.00 |
| AgesAllowed | 0.00 | 0.02 | 0.00 | 0.02 | 0.04 | 0.04 | 0.06 | 0.08 | 0.01 | 0.01 | 0.02 | 0.10 | 0.00 | 0.01 | 0.01 | 0.00 | 0.01 | 0.01 | 0.00 | 0.01 | 0.00 | 0.00 | 0.00 | 0.00 | 0.01 | 0.00 | 0.00 | 0.02 | 0.02 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.30 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.03 | 0.02 | 0.01 | 0.02 | 0.01 | 0.03 | 0.01 | 0.01 | 0.03 | 0.02 | 0.02 | 0.01 | 0.07 | 0.01 | 0.01 | 0.02 | 0.03 | 0.00 | 0.05 | 0.06 | 0.00 | 0.00 | 0.00 | 1.00 | 0.02 | 0.01 |
| Premium_Parking | 0.13 | 0.05 | 0.24 | 0.06 | 0.03 | 0.05 | 0.13 | 0.03 | 0.06 | 0.05 | 0.03 | 0.06 | 0.12 | 0.08 | 0.02 | 0.02 | 0.02 | 0.03 | 0.07 | 0.02 | 0.02 | 0.02 | 0.02 | 0.03 | 0.07 | 0.02 | 0.04 | 0.03 | 0.10 | 0.01 | 0.02 | 0.03 | 0.03 | 0.04 | 0.10 | 0.04 | 0.03 | 0.08 | 0.02 | 0.07 | 0.00 | 0.03 | 0.01 | 0.01 | 0.03 | 0.02 | 0.02 | 0.02 | 0.00 | 0.00 | 0.02 | 0.02 | 0.02 | 0.04 | 0.01 | 0.01 | 0.01 | 0.02 | 0.02 | 0.01 | 0.03 | 0.02 | 0.01 | 0.04 | 0.03 | 0.02 | 0.04 | 0.02 | 0.01 | 0.01 | 0.06 | 0.02 | 0.00 | 0.01 | 0.01 | 0.01 | 0.00 | 0.01 | 0.02 | 0.13 | 0.23 | 0.32 | 0.28 | 0.14 | 0.17 | 0.07 | 0.24 | 0.28 | 0.22 | 0.13 | 0.22 | 0.12 | 0.15 | 0.13 | 0.21 | 0.16 | 0.05 | 0.11 | 0.10 | 0.05 | 0.01 | 0.01 | 0.02 | 1.00 | 0.71 |
| Non_Premium_Parking | 0.11 | 0.04 | 0.16 | 0.06 | 0.05 | 0.06 | 0.09 | 0.00 | 0.09 | 0.05 | 0.03 | 0.08 | 0.09 | 0.10 | 0.00 | 0.01 | 0.03 | 0.03 | 0.04 | 0.03 | 0.03 | 0.03 | 0.05 | 0.03 | 0.07 | 0.02 | 0.05 | 0.04 | 0.04 | 0.00 | 0.03 | 0.03 | 0.05 | 0.04 | 0.10 | 0.03 | 0.05 | 0.05 | 0.03 | 0.04 | 0.00 | 0.03 | 0.02 | 0.01 | 0.03 | 0.02 | 0.03 | 0.04 | 0.00 | 0.01 | 0.02 | 0.05 | 0.02 | 0.03 | 0.00 | 0.02 | 0.00 | 0.02 | 0.03 | 0.00 | 0.03 | 0.02 | 0.02 | 0.03 | 0.02 | 0.01 | 0.04 | 0.02 | 0.02 | 0.00 | 0.02 | 0.02 | 0.00 | 0.01 | 0.02 | 0.01 | 0.00 | 0.01 | 0.01 | 0.14 | 0.25 | 0.33 | 0.31 | 0.09 | 0.18 | 0.09 | 0.26 | 0.27 | 0.28 | 0.14 | 0.21 | 0.09 | 0.18 | 0.14 | 0.22 | 0.16 | 0.06 | 0.15 | 0.08 | 0.07 | 0.01 | 0.00 | 0.01 | 0.71 | 1.00 |
105 rows × 105 columns
The cramers v also ilustrates us the correlation between the vegan and vegetarian variables as we have seen before. Also we can see the correlation between city, state and postal_code as they all represent location.
Not a high correlation but some information a cramers v can yield is the correlation between the aforementioned location variables and the types of restaurant. For example the cajun/creole variable is for restaurants with West Affrican, French and Spanish roots typical from Loissiana.
Since most of the restaurants with this cusine will be in a reduced number of postal codes or cities, they have a high correlation (compared to other type of food variables) with the location variables.
Pearson
Finally we do a spearman correlation between the continous numerical variables, in case we can obtain any additional information than with the pearson correlation. For this, as before, we set a new dataframe with only the continous numerical variables.
df_continous_vairables = restaurants_income[['latitude', 'longitude', 'review_count', 'num_days_open', 'num_hours_open', 'average_income']]
df_continous_vairables.head()
| latitude | longitude | review_count | num_days_open | num_hours_open | average_income | |
|---|---|---|---|---|---|---|
| business_id | ||||||
| MTSW4McQd7CbVtyjqoe9mw | 39.955505 | -75.155564 | 80 | 7.0 | 94.0 | 57735.0 |
| mWMc6_wTdE0EUBKIGXDVfA | 40.338183 | -75.471659 | 13 | 5.0 | 40.0 | 71196.0 |
| CF33F8-E6oudUQ46HnavjQ | 36.269593 | -87.058943 | 6 | 6.0 | 66.0 | 54097.0 |
| k0hlBqXX-Bt0vf1op7Jr1w | 38.565165 | -90.321087 | 19 | 7.0 | 63.0 | 58867.0 |
| bBDDEgkFA1Otx9Lfe7BZUQ | 36.208102 | -86.768170 | 10 | 6.0 | 76.0 | 42973.0 |
get_corr_matrix(dataset = df_continous_vairables,
metodo = 'pearson', size_figure = [10,8])
df_continous_vairables.corr(method = 'pearson')
| latitude | longitude | review_count | num_days_open | num_hours_open | average_income | |
|---|---|---|---|---|---|---|
| latitude | 1.000000 | 0.132024 | -0.034543 | -0.005041 | -0.009474 | 0.339397 |
| longitude | 0.132024 | 1.000000 | -0.038454 | 0.018553 | -0.000583 | 0.200906 |
| review_count | -0.034543 | -0.038454 | 1.000000 | -0.058043 | -0.081420 | -0.013200 |
| num_days_open | -0.005041 | 0.018553 | -0.058043 | 1.000000 | 0.474899 | 0.003031 |
| num_hours_open | -0.009474 | -0.000583 | -0.081420 | 0.474899 | 1.000000 | -0.018921 |
| average_income | 0.339397 | 0.200906 | -0.013200 | 0.003031 | -0.018921 | 1.000000 |
Ending with the correlation study, the continous variables of the dataset, does not yield any high association. Number of days and hours open have a 0.47 correlation as is natural. Also, a 0.33 and 0.2 correlation between average_income and latidude and longitude respectively, as average_income depends on postal_code, hence its correlation.
We will not plot the name, address, latitude and longitude as every row is unique. Also due to the reduced number of variables city, postal_code, review_count, average_income and num_hours_open will not be included in the for used with the plot_feature function. We will perform visualization on these outside the for.
plot_variables_df = restaurants_income.drop(columns=['name', 'address', 'latitude', 'longitude',
'city', 'postal_code', 'review_count',
'average_income', 'num_hours_open'])
plot_variables_df.head()
| state | stars | is_open | restaurants | food | bars | sandwiches | american (traditional) | pizza | coffee & tea | breakfast & brunch | american (new) | fast food | burgers | mexican | italian | specialty food | seafood | desserts | bakeries | salad | chinese | cafes | chicken wings | ice cream & frozen yogurt | beer | sports bars | cocktail bars | juice bars & smoothies | barbeque | japanese | pubs | sushi bars | local flavor | asian fusion | diners | steakhouses | mediterranean | wine bars | southern | cajun/creole | donuts | tacos | soup | thai | beer bar | tex-mex | greek | breweries | vietnamese | chicken shop | hot dogs | indian | bagels | comfort food | cheesesteaks | ethnic food | caribbean | middle eastern | french | soul food | buffets | gastropubs | fruits & veggies | korean | bubble tea | tapas/small plates | food stands | cupcakes | noodles | wineries | cuban | spanish | acai bowls | RestaurantsDelivery | OutdoorSeating | BusinessAcceptsCreditCards | BikeParking | RestaurantsPriceRange2 | RestaurantsTakeOut | ByAppointmentOnly | WiFi | Alcohol | Caters | WheelchairAccessible | RestaurantsReservations | CoatCheck | DogsAllowed | RestaurantsTableService | HasTV | HappyHour | DriveThru | BusinessAcceptsBitcoin | Smoking | BYOBCorkage | AcceptsInsurance | RestaurantsCounterService | AgesAllowed | Premium_Parking | Non_Premium_Parking | dairy-free | gluten-free | vegan | kosher | halal | soy-free | vegetarian | num_days_open | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| business_id | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| MTSW4McQd7CbVtyjqoe9mw | PA | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | False | False | False | True | 1 | True | False | free | False | True | unknown | unknown | unknown | unknown | unknown | unknown | unknown | unknown | unknown | unknown | unknown | unknown | unknown | unknown | False | True | False | False | False | False | False | False | False | 7.0 |
| mWMc6_wTdE0EUBKIGXDVfA | PA | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | unknown | unknown | True | True | 2 | True | unknown | unknown | unknown | False | True | unknown | unknown | unknown | unknown | unknown | unknown | unknown | unknown | unknown | unknown | unknown | unknown | unknown | False | True | False | False | False | False | False | False | False | 5.0 |
| CF33F8-E6oudUQ46HnavjQ | TN | 0 | 1 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | True | True | True | False | 1 | True | False | False | False | False | True | False | False | False | False | True | False | True | unknown | unknown | unknown | unknown | unknown | unknown | False | False | False | False | False | False | False | False | False | 6.0 |
| k0hlBqXX-Bt0vf1op7Jr1w | MO | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | False | True | True | unknown | 1 | True | unknown | free | full_bar | True | unknown | False | unknown | unknown | unknown | True | unknown | unknown | unknown | unknown | unknown | unknown | unknown | unknown | False | True | False | False | False | False | False | False | False | 7.0 |
| bBDDEgkFA1Otx9Lfe7BZUQ | TN | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | True | True | True | unknown | 1 | True | False | False | False | False | True | False | False | False | False | True | False | True | unknown | unknown | unknown | unknown | unknown | unknown | False | False | False | False | False | False | False | False | False | 6.0 |
for i in list(plot_variables_df.columns):
if (plot_variables_df[i].dtype == float) & (i != 'stars'):
plot_feature(plot_variables_df, col_name = i, isContinuous = True, target = 'stars')
elif i != 'stars':
plot_feature(plot_variables_df, col_name = i, isContinuous = False, target = 'stars')
Despite the amount of information provided, it migh not seem nothing to highlight, since the distributions are more or less balanced. But taken a closer look we obtain a few insights.
On one hand, and maybe as one can imagine, fast food business or those who have drive through have more probabilities of having a rating lower than 4 stars (around 80%) as expected. On the other hand, those business that have the attribute of food stands or bubble tea establishments have around 75-80% of 4 stars or more.
Some exceptions might be that all hahal tagged business have four or more stars, meanwhile all soy-free tagged business are considered as unsucessfull, as well as those that does not have counter service.
Average income vs stars
# Create a figure and two subplots
fig, axs = plt.subplots(1, 2, figsize=(10, 5))
# Use the 'review_count' column with values of the 'stars' column equal to 0
sns.distplot(restaurants_income[restaurants_income['stars'] == 0]['average_income'],
bins=20, ax=axs[0])
axs[0].set_title("Distribution for 'stars' = 0")
# Use the 'review_count' column with values of the 'stars' column equal to 1
sns.distplot(restaurants_income[restaurants_income['stars'] == 1]['average_income'],
bins=20, color='darkorange', ax=axs[1])
axs[1].set_title("Distribution for 'stars' = 1")
# Show the plots
plt.show()
Review count vs stars
# Create a figure and two subplots
fig, axs = plt.subplots(1, 2, figsize=(10, 5))
# Use the 'review_count' column with values of the 'stars' column equal to 0
sns.distplot(restaurants_income[restaurants_income['stars'] == 0]['review_count'].apply(np.log1p),
bins=10, ax=axs[0])
axs[0].set_title("Distribution for 'stars' = 0")
# Use the 'review_count' column with values of the 'stars' column equal to 1
sns.distplot(restaurants_income[restaurants_income['stars'] == 1]['review_count'].apply(np.log1p),
bins=10, color='darkorange', ax=axs[1])
axs[1].set_title("Distribution for 'stars' = 1")
# Show the plots
plt.show()
Number of hours open vs stars
# Create a figure and two subplots
fig, axs = plt.subplots(1, 2, figsize=(10, 5))
# Use the 'review_count' column with values of the 'stars' column equal to 0
sns.distplot(restaurants_income[restaurants_income['stars'] == 0]['num_hours_open'],
bins=10, ax=axs[0])
axs[0].set_title("Distribution for 'stars' = 0")
# Use the 'review_count' column with values of the 'stars' column equal to 1
sns.distplot(restaurants_income[restaurants_income['stars'] == 1]['num_hours_open'],
bins=10, color='darkorange', ax=axs[1])
axs[1].set_title("Distribution for 'stars' = 1")
# Show the plots
plt.show()
For these three variables we have plot separately we do not observe a huge variation between the densities of 1 and 0 values from the target variable. No further comments.
In order to process our data we will encode our data, giving them numeric values and normalizate the whole dataset in such a way the highest number of each variable is 1 and the lowest 0.
But first, we will drop the following variables: name', 'address', 'latitude', 'longitude' and 'city'. This is due to they are mostly unique values if not unique. Hence, they will not provide useful information to our models.
restaurants_income = restaurants_income.drop(columns=['name', 'address', 'latitude', 'longitude',
'city'])
restaurants_income.head()
| state | postal_code | stars | review_count | is_open | restaurants | food | bars | sandwiches | american (traditional) | pizza | coffee & tea | breakfast & brunch | american (new) | fast food | burgers | mexican | italian | specialty food | seafood | desserts | bakeries | salad | chinese | cafes | chicken wings | ice cream & frozen yogurt | beer | sports bars | cocktail bars | juice bars & smoothies | barbeque | japanese | pubs | sushi bars | local flavor | asian fusion | diners | steakhouses | mediterranean | wine bars | southern | cajun/creole | donuts | tacos | soup | thai | beer bar | tex-mex | greek | breweries | vietnamese | chicken shop | hot dogs | indian | bagels | comfort food | cheesesteaks | ethnic food | caribbean | middle eastern | french | soul food | buffets | gastropubs | fruits & veggies | korean | bubble tea | tapas/small plates | food stands | cupcakes | noodles | wineries | cuban | spanish | acai bowls | RestaurantsDelivery | OutdoorSeating | BusinessAcceptsCreditCards | BikeParking | RestaurantsPriceRange2 | RestaurantsTakeOut | ByAppointmentOnly | WiFi | Alcohol | Caters | WheelchairAccessible | RestaurantsReservations | CoatCheck | DogsAllowed | RestaurantsTableService | HasTV | HappyHour | DriveThru | BusinessAcceptsBitcoin | Smoking | BYOBCorkage | AcceptsInsurance | RestaurantsCounterService | AgesAllowed | Premium_Parking | Non_Premium_Parking | dairy-free | gluten-free | vegan | kosher | halal | soy-free | vegetarian | num_days_open | num_hours_open | average_income | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| business_id | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| MTSW4McQd7CbVtyjqoe9mw | PA | 19107 | 1 | 80 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | False | False | False | True | 1 | True | False | free | False | True | unknown | unknown | unknown | unknown | unknown | unknown | unknown | unknown | unknown | unknown | unknown | unknown | unknown | unknown | False | True | False | False | False | False | False | False | False | 7.0 | 94.0 | 57735.0 |
| mWMc6_wTdE0EUBKIGXDVfA | PA | 18054 | 1 | 13 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | unknown | unknown | True | True | 2 | True | unknown | unknown | unknown | False | True | unknown | unknown | unknown | unknown | unknown | unknown | unknown | unknown | unknown | unknown | unknown | unknown | unknown | False | True | False | False | False | False | False | False | False | 5.0 | 40.0 | 71196.0 |
| CF33F8-E6oudUQ46HnavjQ | TN | 37015 | 0 | 6 | 1 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | True | True | True | False | 1 | True | False | False | False | False | True | False | False | False | False | True | False | True | unknown | unknown | unknown | unknown | unknown | unknown | False | False | False | False | False | False | False | False | False | 6.0 | 66.0 | 54097.0 |
| k0hlBqXX-Bt0vf1op7Jr1w | MO | 63123 | 0 | 19 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | False | True | True | unknown | 1 | True | unknown | free | full_bar | True | unknown | False | unknown | unknown | unknown | True | unknown | unknown | unknown | unknown | unknown | unknown | unknown | unknown | False | True | False | False | False | False | False | False | False | 7.0 | 63.0 | 58867.0 |
| bBDDEgkFA1Otx9Lfe7BZUQ | TN | 37207 | 0 | 10 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | True | True | True | unknown | 1 | True | False | False | False | False | True | False | False | False | False | True | False | True | unknown | unknown | unknown | unknown | unknown | unknown | False | False | False | False | False | False | False | False | False | 6.0 | 76.0 | 42973.0 |
The first encoding we will perform is to all attributes variables except for the ones dropped below. These variables have unique values of True, False and Unknown.
selected_columns = restaurants_income.iloc[:, 76:-3].drop(
columns=['WiFi', 'Alcohol', 'Smoking', 'BYOBCorkage','AgesAllowed',
'RestaurantsPriceRange2', 'Premium_Parking','Non_Premium_Parking'])
selected_columns = selected_columns.apply(lambda x: x.astype(str))
We encode True as 1, False as 0 and unknown as 0.5
selected_columns = selected_columns.applymap(
lambda x: 1 if x == "True" else (0 if x == "False" else 0.5))
The parking variables had unique values of True, False and 0. In this case we will transform the True in 1 and the rest 0.
parking_columns = restaurants_income[['Premium_Parking','Non_Premium_Parking']]
parking_columns = parking_columns.applymap(
lambda x: 1 if x == "True" else 0)
WiFi, had unique values of Free, Paid, Unknown and False. It seemed reasonable to stablish all the unknowns as 0.5.
Hence, starting from there, False is 0. We are left witn free and paid. Being free the best option we have decided to provide it with a value of 1 and paid with a vaule between free and unknown, 0.75.
restaurants_income['WiFi'] = restaurants_income['WiFi'].apply(
lambda x: 1 if x == "free" else (0.75 if x == "paid" else (0.5 if x =="unknown" else 0)))
Facing the encoding of Alcohol, Smoking and BYOBCorkage, we faced a similar dilemma than with WiFi. We have decided to mantain the 0 for False, 0.5 for Unknown and give the less restrictive option a 1 (full_bar, yes and yes_free respectively) and 0.75 for the more restrictive option without being False (beer_and_wine, outdoor and yes_corkage).
restaurants_income['Alcohol'] = restaurants_income['Alcohol'].apply(
lambda x: 1 if x == "full_bar" else (0.75 if x == "beer_and_wine" else (
0.5 if x =="unknown" else 0)))
restaurants_income['Smoking'] = restaurants_income['Smoking'].apply(
lambda x: 1 if x == "yes" else (0.75 if x == "beer_and_wine" else (
0.5 if x =="unknown" else 0)))
restaurants_income['BYOBCorkage'] = restaurants_income['BYOBCorkage'].apply(
lambda x: 1 if x == "yes_free" else (0.75 if x == "yes_corkage" else (
0.5 if x =="unknown" else 0)))
In order to encode the AgesAllowed variable, we confront a different situation, here we do not have false nor true. Instead the extreme values are "allages" and "21plus". Since 21plus is the more restrictive we have give it a value of 0 and 1 to allages.
As in the rest of the variables, "unknown" stays at 0.5
The final value to encode will be "18plus". Here our decision has been to give it a 0.25, since its value resembles more to the 21plus rather than all ages.
restaurants_income['AgesAllowed'] = restaurants_income['AgesAllowed'].apply(
lambda x: 1 if x == "allages" else (0.5 if x == "unknown" else (
0.25 if x =="18plus" else 0)))
Now we update the values in restaurants_income of the variables stored in selected_columns as well as in parking_columns
restaurants_income = restaurants_income.drop(
columns = ['RestaurantsDelivery', 'OutdoorSeating', 'BusinessAcceptsCreditCards',
'BikeParking', 'RestaurantsTakeOut', 'ByAppointmentOnly', 'Caters',
'WheelchairAccessible', 'RestaurantsReservations', 'CoatCheck',
'DogsAllowed', 'RestaurantsTableService', 'HasTV', 'HappyHour',
'DriveThru', 'BusinessAcceptsBitcoin', 'AcceptsInsurance',
'RestaurantsCounterService', 'dairy-free', 'gluten-free', 'vegan',
'kosher', 'halal', 'soy-free', 'vegetarian', 'Premium_Parking','Non_Premium_Parking'])
restaurants_income = restaurants_income.join(selected_columns, how="left")
restaurants_income.head()
| state | postal_code | stars | review_count | is_open | restaurants | food | bars | sandwiches | american (traditional) | pizza | coffee & tea | breakfast & brunch | american (new) | fast food | burgers | mexican | italian | specialty food | seafood | desserts | bakeries | salad | chinese | cafes | chicken wings | ice cream & frozen yogurt | beer | sports bars | cocktail bars | juice bars & smoothies | barbeque | japanese | pubs | sushi bars | local flavor | asian fusion | diners | steakhouses | mediterranean | wine bars | southern | cajun/creole | donuts | tacos | soup | thai | beer bar | tex-mex | greek | breweries | vietnamese | chicken shop | hot dogs | indian | bagels | comfort food | cheesesteaks | ethnic food | caribbean | middle eastern | french | soul food | buffets | gastropubs | fruits & veggies | korean | bubble tea | tapas/small plates | food stands | cupcakes | noodles | wineries | cuban | spanish | acai bowls | RestaurantsPriceRange2 | WiFi | Alcohol | Smoking | BYOBCorkage | AgesAllowed | num_days_open | num_hours_open | average_income | RestaurantsDelivery | OutdoorSeating | BusinessAcceptsCreditCards | BikeParking | RestaurantsTakeOut | ByAppointmentOnly | Caters | WheelchairAccessible | RestaurantsReservations | CoatCheck | DogsAllowed | RestaurantsTableService | HasTV | HappyHour | DriveThru | BusinessAcceptsBitcoin | AcceptsInsurance | RestaurantsCounterService | dairy-free | gluten-free | vegan | kosher | halal | soy-free | vegetarian | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| business_id | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| MTSW4McQd7CbVtyjqoe9mw | PA | 19107 | 1 | 80 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1.0 | 0.0 | 0.5 | 0.5 | 0.5 | 7.0 | 94.0 | 57735.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 1.0 | 0.5 | 0.5 | 0.5 | 0.5 | 0.5 | 0.5 | 0.5 | 0.5 | 0.5 | 0.5 | 0.5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| mWMc6_wTdE0EUBKIGXDVfA | PA | 18054 | 1 | 13 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0.5 | 0.5 | 0.5 | 0.5 | 0.5 | 5.0 | 40.0 | 71196.0 | 0.5 | 0.5 | 1.0 | 1.0 | 1.0 | 0.5 | 0.0 | 1.0 | 0.5 | 0.5 | 0.5 | 0.5 | 0.5 | 0.5 | 0.5 | 0.5 | 0.5 | 0.5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| CF33F8-E6oudUQ46HnavjQ | TN | 37015 | 0 | 6 | 1 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0.0 | 0.0 | 0.5 | 0.5 | 0.5 | 6.0 | 66.0 | 54097.0 | 1.0 | 1.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.5 | 0.5 | 0.5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| k0hlBqXX-Bt0vf1op7Jr1w | MO | 63123 | 0 | 19 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1.0 | 1.0 | 0.5 | 0.5 | 0.5 | 7.0 | 63.0 | 58867.0 | 0.0 | 1.0 | 1.0 | 0.5 | 1.0 | 0.5 | 1.0 | 0.5 | 0.0 | 0.5 | 0.5 | 0.5 | 1.0 | 0.5 | 0.5 | 0.5 | 0.5 | 0.5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| bBDDEgkFA1Otx9Lfe7BZUQ | TN | 37207 | 0 | 10 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0.0 | 0.0 | 0.5 | 0.5 | 0.5 | 6.0 | 76.0 | 42973.0 | 1.0 | 1.0 | 1.0 | 0.5 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.5 | 0.5 | 0.5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
restaurants_income = restaurants_income.join(parking_columns, how="left")
restaurants_income.head()
| state | postal_code | stars | review_count | is_open | restaurants | food | bars | sandwiches | american (traditional) | pizza | coffee & tea | breakfast & brunch | american (new) | fast food | burgers | mexican | italian | specialty food | seafood | desserts | bakeries | salad | chinese | cafes | chicken wings | ice cream & frozen yogurt | beer | sports bars | cocktail bars | juice bars & smoothies | barbeque | japanese | pubs | sushi bars | local flavor | asian fusion | diners | steakhouses | mediterranean | wine bars | southern | cajun/creole | donuts | tacos | soup | thai | beer bar | tex-mex | greek | breweries | vietnamese | chicken shop | hot dogs | indian | bagels | comfort food | cheesesteaks | ethnic food | caribbean | middle eastern | french | soul food | buffets | gastropubs | fruits & veggies | korean | bubble tea | tapas/small plates | food stands | cupcakes | noodles | wineries | cuban | spanish | acai bowls | RestaurantsPriceRange2 | WiFi | Alcohol | Smoking | BYOBCorkage | AgesAllowed | num_days_open | num_hours_open | average_income | RestaurantsDelivery | OutdoorSeating | BusinessAcceptsCreditCards | BikeParking | RestaurantsTakeOut | ByAppointmentOnly | Caters | WheelchairAccessible | RestaurantsReservations | CoatCheck | DogsAllowed | RestaurantsTableService | HasTV | HappyHour | DriveThru | BusinessAcceptsBitcoin | AcceptsInsurance | RestaurantsCounterService | dairy-free | gluten-free | vegan | kosher | halal | soy-free | vegetarian | Premium_Parking | Non_Premium_Parking | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| business_id | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| MTSW4McQd7CbVtyjqoe9mw | PA | 19107 | 1 | 80 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1.0 | 0.0 | 0.5 | 0.5 | 0.5 | 7.0 | 94.0 | 57735.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 1.0 | 0.5 | 0.5 | 0.5 | 0.5 | 0.5 | 0.5 | 0.5 | 0.5 | 0.5 | 0.5 | 0.5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| mWMc6_wTdE0EUBKIGXDVfA | PA | 18054 | 1 | 13 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0.5 | 0.5 | 0.5 | 0.5 | 0.5 | 5.0 | 40.0 | 71196.0 | 0.5 | 0.5 | 1.0 | 1.0 | 1.0 | 0.5 | 0.0 | 1.0 | 0.5 | 0.5 | 0.5 | 0.5 | 0.5 | 0.5 | 0.5 | 0.5 | 0.5 | 0.5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| CF33F8-E6oudUQ46HnavjQ | TN | 37015 | 0 | 6 | 1 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0.0 | 0.0 | 0.5 | 0.5 | 0.5 | 6.0 | 66.0 | 54097.0 | 1.0 | 1.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.5 | 0.5 | 0.5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| k0hlBqXX-Bt0vf1op7Jr1w | MO | 63123 | 0 | 19 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1.0 | 1.0 | 0.5 | 0.5 | 0.5 | 7.0 | 63.0 | 58867.0 | 0.0 | 1.0 | 1.0 | 0.5 | 1.0 | 0.5 | 1.0 | 0.5 | 0.0 | 0.5 | 0.5 | 0.5 | 1.0 | 0.5 | 0.5 | 0.5 | 0.5 | 0.5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| bBDDEgkFA1Otx9Lfe7BZUQ | TN | 37207 | 0 | 10 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0.0 | 0.0 | 0.5 | 0.5 | 0.5 | 6.0 | 76.0 | 42973.0 | 1.0 | 1.0 | 1.0 | 0.5 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.5 | 0.5 | 0.5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Finally we are left with 2 columns that represent location: state and postal_code. Despite postal_code yields more information the number of unique values is more than 80 times more than state. Our intention is create dummy variables and include more than 600 new variables does not seem something reasonal.
Moreover, in the case we drop postal_code, we still would be able to differenciate between different states, as the average_income variable depends on the zip code. For these reason we will drop the variable postal_code and we will create dummies for the state variable.
restaurants_income = restaurants_income.drop(columns='postal_code')
restaurants_income["state"].unique()
array(['PA', 'TN', 'MO', 'FL', 'AZ', 'LA', 'CA', 'ID'], dtype=object)
dummy_vars = pd.get_dummies(restaurants_income['state'], prefix='state')
dummy_vars.head()
| state_AZ | state_CA | state_FL | state_ID | state_LA | state_MO | state_PA | state_TN | |
|---|---|---|---|---|---|---|---|---|
| business_id | ||||||||
| MTSW4McQd7CbVtyjqoe9mw | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| mWMc6_wTdE0EUBKIGXDVfA | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| CF33F8-E6oudUQ46HnavjQ | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| k0hlBqXX-Bt0vf1op7Jr1w | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| bBDDEgkFA1Otx9Lfe7BZUQ | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
FInally we add them to the rest of the dataset.
restaurants_income = pd.concat([restaurants_income.drop(columns='state'), dummy_vars], axis=1)
restaurants_income.head()
| stars | review_count | is_open | restaurants | food | bars | sandwiches | american (traditional) | pizza | coffee & tea | breakfast & brunch | american (new) | fast food | burgers | mexican | italian | specialty food | seafood | desserts | bakeries | salad | chinese | cafes | chicken wings | ice cream & frozen yogurt | beer | sports bars | cocktail bars | juice bars & smoothies | barbeque | japanese | pubs | sushi bars | local flavor | asian fusion | diners | steakhouses | mediterranean | wine bars | southern | cajun/creole | donuts | tacos | soup | thai | beer bar | tex-mex | greek | breweries | vietnamese | chicken shop | hot dogs | indian | bagels | comfort food | cheesesteaks | ethnic food | caribbean | middle eastern | french | soul food | buffets | gastropubs | fruits & veggies | korean | bubble tea | tapas/small plates | food stands | cupcakes | noodles | wineries | cuban | spanish | acai bowls | RestaurantsPriceRange2 | WiFi | Alcohol | Smoking | BYOBCorkage | AgesAllowed | num_days_open | num_hours_open | average_income | RestaurantsDelivery | OutdoorSeating | BusinessAcceptsCreditCards | BikeParking | RestaurantsTakeOut | ByAppointmentOnly | Caters | WheelchairAccessible | RestaurantsReservations | CoatCheck | DogsAllowed | RestaurantsTableService | HasTV | HappyHour | DriveThru | BusinessAcceptsBitcoin | AcceptsInsurance | RestaurantsCounterService | dairy-free | gluten-free | vegan | kosher | halal | soy-free | vegetarian | Premium_Parking | Non_Premium_Parking | state_AZ | state_CA | state_FL | state_ID | state_LA | state_MO | state_PA | state_TN | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| business_id | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| MTSW4McQd7CbVtyjqoe9mw | 1 | 80 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1.0 | 0.0 | 0.5 | 0.5 | 0.5 | 7.0 | 94.0 | 57735.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 1.0 | 0.5 | 0.5 | 0.5 | 0.5 | 0.5 | 0.5 | 0.5 | 0.5 | 0.5 | 0.5 | 0.5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| mWMc6_wTdE0EUBKIGXDVfA | 1 | 13 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0.5 | 0.5 | 0.5 | 0.5 | 0.5 | 5.0 | 40.0 | 71196.0 | 0.5 | 0.5 | 1.0 | 1.0 | 1.0 | 0.5 | 0.0 | 1.0 | 0.5 | 0.5 | 0.5 | 0.5 | 0.5 | 0.5 | 0.5 | 0.5 | 0.5 | 0.5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| CF33F8-E6oudUQ46HnavjQ | 0 | 6 | 1 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0.0 | 0.0 | 0.5 | 0.5 | 0.5 | 6.0 | 66.0 | 54097.0 | 1.0 | 1.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.5 | 0.5 | 0.5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| k0hlBqXX-Bt0vf1op7Jr1w | 0 | 19 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1.0 | 1.0 | 0.5 | 0.5 | 0.5 | 7.0 | 63.0 | 58867.0 | 0.0 | 1.0 | 1.0 | 0.5 | 1.0 | 0.5 | 1.0 | 0.5 | 0.0 | 0.5 | 0.5 | 0.5 | 1.0 | 0.5 | 0.5 | 0.5 | 0.5 | 0.5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| bBDDEgkFA1Otx9Lfe7BZUQ | 0 | 10 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0.0 | 0.0 | 0.5 | 0.5 | 0.5 | 6.0 | 76.0 | 42973.0 | 1.0 | 1.0 | 1.0 | 0.5 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.5 | 0.5 | 0.5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
Since all the data in the dataset left is numeric, in order to normalize the dataset we transform all the columns to type ìnt64
restaurants_income = restaurants_income.apply(lambda x: x.astype('int64'))
In order to scale the variables we will use the MinMaxScaler function form the sklearn library, transforming all the values into a range of 0 and 1
scaler = MinMaxScaler()
restaurants_income_scaled = pd.DataFrame(
scaler.fit_transform(restaurants_income), columns=restaurants_income.columns)
restaurants_income_scaled.head()
| stars | review_count | is_open | restaurants | food | bars | sandwiches | american (traditional) | pizza | coffee & tea | breakfast & brunch | american (new) | fast food | burgers | mexican | italian | specialty food | seafood | desserts | bakeries | salad | chinese | cafes | chicken wings | ice cream & frozen yogurt | beer | sports bars | cocktail bars | juice bars & smoothies | barbeque | japanese | pubs | sushi bars | local flavor | asian fusion | diners | steakhouses | mediterranean | wine bars | southern | cajun/creole | donuts | tacos | soup | thai | beer bar | tex-mex | greek | breweries | vietnamese | chicken shop | hot dogs | indian | bagels | comfort food | cheesesteaks | ethnic food | caribbean | middle eastern | french | soul food | buffets | gastropubs | fruits & veggies | korean | bubble tea | tapas/small plates | food stands | cupcakes | noodles | wineries | cuban | spanish | acai bowls | RestaurantsPriceRange2 | WiFi | Alcohol | Smoking | BYOBCorkage | AgesAllowed | num_days_open | num_hours_open | average_income | RestaurantsDelivery | OutdoorSeating | BusinessAcceptsCreditCards | BikeParking | RestaurantsTakeOut | ByAppointmentOnly | Caters | WheelchairAccessible | RestaurantsReservations | CoatCheck | DogsAllowed | RestaurantsTableService | HasTV | HappyHour | DriveThru | BusinessAcceptsBitcoin | AcceptsInsurance | RestaurantsCounterService | dairy-free | gluten-free | vegan | kosher | halal | soy-free | vegetarian | Premium_Parking | Non_Premium_Parking | state_AZ | state_CA | state_FL | state_ID | state_LA | state_MO | state_PA | state_TN | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1.0 | 0.009917 | 1.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.000000 | 0.559524 | 0.180427 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| 1 | 1.0 | 0.001058 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.333333 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.666667 | 0.238095 | 0.237808 | 0.0 | 0.0 | 1.0 | 1.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| 2 | 0.0 | 0.000132 | 1.0 | 1.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.833333 | 0.392857 | 0.164920 | 1.0 | 1.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| 3 | 0.0 | 0.001851 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.000000 | 0.375000 | 0.185253 | 0.0 | 1.0 | 1.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 |
| 4 | 0.0 | 0.000661 | 1.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.833333 | 0.452381 | 0.117501 | 1.0 | 1.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
Finally, we take our data and export it into a csv under the path "../data/processed/". This will be from where we will import our data in the two final notebooks.
restaurants_income_scaled.to_csv('../data/processed/restaurants_models.csv', index=False)